Specifying and Implementing Data Infrastructures Enabling Data Intensive Science
نویسندگان
چکیده
Examples from Psycholinguistics – a humanities discipline – show that data intensive research is changing all scientific disciplines dramatically. Data intensive sciences pose unprecedented challenges in data management and processing. A survey in Europe showed clearly that most of the research departments are not prepared for this step and that the methods that are used to manage, curate and process data are inefficient and too costly. The Research Data Alliance, as a bottomup organized global and cross-disciplinary initiative, has been established to accelerate the process of changing data practice. After only two years RDA produced its first concrete results, which have to demonstrate their potential. In particular, the infrastructure builders are requested to act as early adopters of RDA results. The European Commission and its member states have taken serious steps to establish an eco-system of research infrastructures and e-Infrastructures anticipating the challenges imposed by the data deluge which will enable broad uptake of the paradigm of data intensive science. Research organisations have recognised these challenges as well and taken first steps to adapt its structures. However, we need to understand that we are in a phase of gigantic changes which implies that measures currently being taken need to be interpreted as tests on the way to new solid and sustainable structures. 1. Enabling Data Intensive Sciences Quite a number of scientific institutes have been data oriented for a long time already. For instance, most of the research of the experimental and theoretical institutes of the Max Planck Society was based on data. Even an institute that belongs to the humanities section of the Max Planck Society such as our former affiliation the Institute for Psycholinguistics [1] was oriented from the start towards the analysis of speech, eye movement and gesture recordings, detecting meaningful patterns, and building models to simulate speech perception. In physics institutes (fusion, astronomy, etc.) of course much larger volumes of data were being processed and they can look back to a much longer history of data oriented work. It was the book "The Fourth Paradigm – Data Intensive Scientific Discovery" [2] edited by Tony Hey and colleagues that introduced “data intensive science” as the 4 paradigm of scientific discovery by referring to a talk given by J. Gray. It raised much attention for the concept behind this new paradigm. Gray distinguishes 4 paradigms that are co-existing today: (1) Empirical Science describing natural phenomena, (2) Theoretical Science using models to achieve generalizations, (3) Computational Science simulating complex phenomena and (4) Data exploration by unifying theory, experiment and simulation. Indeed, we can observe that science is changing in so far as finding meaningful patterns in data sets becomes an essential approach. Increasingly more powerful and numerous sensors, improved network connections, more powerful and numerous computers and more advanced algorithms are key pillars for this development. The "Riding the Wave" [3] report created by a High Level Expert Group of the European Commission (EC) was one of the documents that summarized the specific data challenges and opportunities, and requested actions by the EC to enable data intensive sciences for a large number of researchers and not only those that have sufficient funding to curate all data and software to be integrated to make use of it. We see a number of trends which we can summarize as follows: An increasing number of research disciplines adopted data intensive methods due to new technological and methodological possibilities. During the last decades these changes were extreme in biological and neurological disciplines. The amount of data and its complexity in terms of creation contexts, data types and relations are increasing extremely. The Internet allows us to offer data via the web to be re-used by others. This enables us to combine data sets in new ways across institutional, national and discipline borders. _______________________________________ Proceedings of the XVII International Conference «Data Analytics and Management in Data Intensive Domains» (DAMDID/RCDL’2015), Obninsk, Russia,
منابع مشابه
EPOS: A Novel Use of CERIF for Data-intensive Science
One of the key aspects of the approaching data-intensive science era is integration of data through interoperability of systems providing data products or visualization and processing services. Far from being simple, interoperability requires robust and scalable e-infrastructures capable of supporting it. In this work we present the case of EPOS, a plan for data integration in the field of Eart...
متن کاملTowards Implementing Virtual Data Infrastructures - a Case Study with iRODS
Scientists demand easy-to-use, scalable and flexible infrastructures for sharing, managing and processing their data spread over multiple resources accessible via different technologies and interfaces. In our previous work, we developed the conceptual framework VISPA for addressing these requirements. This paper provides a case study assessing the integrated Rule-Oriented Data System (iRODS) fo...
متن کاملModeling Users' Data Usage Experiences from Scientific Literature
In the new data-intensive science paradigm, data infrastructures have been designed and built to collect, archive, publish, and analyze scientific data for a variety of users. Little attention, however, has been paid to users of these data infrastructures. This study endeavors to improve our understanding of these users’ data usage models through a content analysis of publications related to a ...
متن کاملXXXX rSYBL: a Framework for Specifying and Controlling Cloud Services Elasticity
Cloud applications can benefit from the on-demand capacity of cloud infrastructures, which offer computing and data resources with diverse capabilities, pricing and quality models. However, state-of-the-art tools mainly enable the user to specify ”if-then-else” policies concerning resource usage and size, resulting in a cumbersome specification process that lacks expressiveness for enabling the...
متن کاملGRESS - a Grid Replica Selection Service
Grid technologies and infrastructures facilitate distributed resource sharing and coordination in dynamic, heterogeneous, multi-institutional environments. A replica catalog is a Grid component that keeps replica locations of data objects and provides location transparency to data access. Replica selection is of great importance to data-intensive scientific computing targeted by many data Grid ...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
دوره شماره
صفحات -
تاریخ انتشار 2015